6  Data Wrangling for Data Visualization

Author

Aaron R. Williams

library(tidyverse)

6.1 Import Packages and Functions

  • filter() - drop cases based on logical conditions
  • mutate() - add new variables are transform existing variables
  • pivot_wider() - make data wider
  • pivot_longer() - make data longer
  • separate() - split a character variable into multiple columns based on a delimiter

6.2 Tidy Data

This data comes from library(tidyr) and these examples largely come from chapter 12 in R for Data Science by Hadley Wickham and Garrett Grolemund.

table1, table2, table3, table4a, table4b, and table5 all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000.

The following table, table1 is tidy because

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell
table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

6.3 Untidy table2

Why aren’t the data tidy?

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Each observation was spread across two rows!

table2 %>%
  pivot_wider(names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

6.4 Untidy table3

Why aren’t the data tidy?

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

The rate column had two variables!

table3 %>%
  separate(rate, into = c("cases", "population")) %>%
  mutate(
    cases = as.numeric(cases),
    population = as.numeric(population)
  )
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

6.5 Untidy table4

Why aren’t the data tidy?

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Each table had variables in the column names and the data were spread across two tables

# fix table4a
table4a_fixed <- table4a %>%
  pivot_longer(
    cols = c(`1999`, `2000`), 
    names_to = "year", 
    values_to = "cases"
  )

# fix table4b
table4b_fixed <- table4b %>%
  pivot_longer(
    cols = c(`1999`, `2000`), 
    names_to = "year", 
    values_to = "population"
  )

# join the two tables into one tidy table
left_join(
  table4a_fixed, 
  table4b_fixed, 
  by = c("country", "year")
)
# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <dbl>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

6.6 Titanic Data

Consider the Titanic data set from the 3D plots example.

tribble(
  ~Class, ~Sex, ~n,
  "1st class", "female passengers", 144,
  "1st class", "male passengers", 179,
  "2nd class", "female passengers", 106,
  "2nd class", "male passengers", 171, 
  "3rd class", "female passengers", 216,
  "3rd class", "male passengers", 493
)
Exercise 1

Tidy the following data:

tribble(
  ~Class, ~female_passengers, ~male_passengers,
  "1st class", 144, 179,
  "2nd class", 106, 171,
  "3rd class", 216, 493
)
Exercise 2

Tidy the following data:

tribble(
  ~Class, ~`male passengers/female passengers`,
  "1st class", "179|144",
  "2nd class", "171|106",
  "3rd class", "493|216",
)
Exercise 3

Suppose we want to create a column chart with only female passengers. Furthermore, we want n on the x-axis and we want Class on the y-axis with the levels going from 1st to 3rd from top to bottom.

Step 1: Use filter() to drop male passengers.

Step 2: Create a column chart with n on the x-axis and Class on the y-axis. Pipe (%>%) the data into ggplot().

Step 3: After the filter(), add the following:

  mutate(Class = factor(Class, levels = c("3rd class", "2nd class", "1st class"))) %>%

6.7 Additional tools

  • library(stringr) has powerful tools for dealing with text strings.
  • library(lubridate) has powerful tools for dealing with time and dates.